Summary:
Analysis log for clinal fundulus project
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
# edge definitions with the node IDs
tab1 -> tab2 -> tab3 -> tab4 -> tab5 -> tab6 -> tab7
tab6 -> tab8
}
[1]: 'library prep (Elshire GBS)'
[2]: 'sequencing (illumina hiseq2000)'
[3]: 'qc checks and datawrangling (fastqc and bash)'
[4]: 'demultiplex and read filtering (stacks: process_radtags)'
[5]: 'read mapping (bwa_mem)'
[6]: 'genotype likelihoods (ANGSD)'
[7]: 'demography and population structure'
[8]: 'selection'
")
The clinal dataset is composed of 9 lanes of HiSeq2500 run in SE100. These reads include 1478 individuals from 5 separate experiments plus and additional set of lanes used for undersampled regions (CT, NC, SC and ME).
Lane names:
H7T6MADXX_1 H7T6MADXX_2 H9AREADXX_1 H9AREADXX_2 H9CWFADXX_1 H9WYGADXX_1 H9WYGADXX_2 HMMKLADXX_1 HMMKLADXX_2
Samples
sites<-read.table("./sites/sites.txt", header = T, sep = "\t")
kable(sites)
| Population_Full_Name | Abbreviation | N | Longitude | Latitude |
|---|---|---|---|---|
| Brayton Point, MA | BP | 50 | -71.18604 | 41.71250 |
| Clinton, CT | CT | 24 | -72.52448 | 41.27902 |
| Sapelo Isalnd Ferry Dock, Brunswick, GA | GA | 41 | -81.35452 | 31.44932 |
| Horseneck Beach, MA | HB | 50 | -71.02556 | 41.50449 |
| Kings Creek, Severn, VA | KC | 25 | -76.42462 | 37.29845 |
| Mattapoisett, MA | M | 31 | -70.81464 | 41.65875 |
| Mount Desert Island Bio Lab | MDIBL | 24 | -68.29441 | 44.42901 |
| Wiscassett, ME | ME | 27 | -69.66481 | 43.99695 |
| Mantoloking, NJ | MG | 331 | -74.07045 | 40.05046 |
| New Bedford Harbor | NBH, F, H, P, Syc | 150 | -70.90760 | 41.62486 |
| Manteo, NC | NC | 24 | -75.66737 | 35.90029 |
| Oyster Creek, NJ | OC | 47 | -74.18475 | 39.80866 |
| Rutgers, Tuckerton, NJ | RB | 423 | -74.32448 | 39.50878 |
| Charleston, SC | SC | 26 | -79.91624 | 32.75873 |
| Stone Harbor | SH | 62 | -74.76492 | 39.05666 |
| Slocum River, MA | SLC | 30 | -70.97109 | 41.52257 |
| Succotash Marsh, Matunuck, RI | SR | 49 | -71.52557 | 41.38235 |
| Wilmington, NC (Carolina Beach State Park) | WNC | 30 | -77.91932 | 34.04998 |
| grandis | grandis | 34 | -84.17740 | 30.07200 |
#!/bin/bash
# set the number of nodes
#SBATCH --nodes=1
# set the number of cpus
#SBATCH --cpus-per-task=6
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-02:00:00
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=fastqc
# set name of output file
#SBATCH --output=fastqc.out
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
files="H7T6MADXX_1
H7T6MADXX_2
H9AREADXX_1
H9AREADXX_2
H9CWFADXX_1
H9WYGADXX_1
H9WYGADXX_2
HMMKLADXX_1
HMMKLADXX_2
"
for file in $files
do
/home/jgibbons/SOFTWARE/FastQC/fastqc -t 4 /home/ddayan/fundulus/seqdata/${file}_fastq.gz -o /home/ddayan/fundulus/seqdata/fastqc
done
fastqc looks good
We will use ANGSD to create population level allele frequency estimates rather than call indiviudal genotypes. The input files for ANGSD of indivudal level BAM files. Therefore the first step after QC check is to demultiplex, remove low quality reads and trim adapter and barcode sequences. Ths is accomplished with the process_radtags program from stacks v2.3.
Barcode keys (files used to demultiplex sequence data) are a bit complex given that some fish are sequenced twice and stacks requires each lane be run separately through process radtags. The solution is to give duplicate fish a unique ID suffix, but with a shared prefix, then cleaned demultiplexed reads from each fish in each lane are combined with their additional reads using prefix matching.
make new fish_ids, based on only population and library prep id, then mark duplicates (duplicate fish_ids occur because some individual fish are sequenced twice i.e. in different lanes)
barcodes<-read.table("./snp_calling/barcode_keys/cline_barcode_key.txt", sep = "\t", header = T)
tmp<-paste(barcodes$Pop , "_", barcodes$LibraryPrepID, sep = "")
barcodes$fish_id<-make.names(tmp, unique = TRUE)
write.csv(barcodes, row.names = FALSE, file = "./snp_calling/barcode_keys/cline_barcode_key.csv", quote = FALSE)
used python script on hand for slightly different format: create a unique single variable for sequencing lane (i.e. colbind flowcell and lane with a _) then run python script below
""" The input file has four columns, this script takes writes columns 2 and 3 (barcode and individual) to a new file based on the value of column 4."""
import csv
with open('./snp_calling/barcode_keys/cline_barcode_key.csv') as fin:
csvin = csv.DictReader(fin)
# Category -> open file lookup
outputs = {}
for row in csvin:
cat = row['Flowcell_Lane']
# Open a new file and write the header
if cat not in outputs:
fout = open('./snp_calling/barcode_keys/{}_key.csv'.format(cat), 'w')
dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
dw.writeheader()
outputs[cat] = fout, dw
# Always write the row
outputs[cat][1].writerow(row)
# Close all the files
for fout, _ in outputs.values():
fout.close()
oops, only meant to keep barcode and sample id and need to write to tab delimited file
for i in ./snp_calling/barcode_keys/*csv
do
cut -d "," -f 2,10 $i > ${i%.csv}.tmp
done
for i in ./snp_calling/barcode_keys/*tmp
do
tr "," "\\t" < $i > ${i%.tmp}_barcodes.txt
done
clean up directory
rm ./snp_calling/barcode_keys/*.tmp
rm ./snp_calling/barcode_keys/*[12]_key.csv
remove first line from all files
sed -i -e "1d" ./meta_data/*_barcodes.txt
Summary
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=bwa_default
# set name of output file
#SBATCH --output=bwadefault.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
for i in `find ../cleaned_tags/ -name '*.gz'`
do
/opt/bio-bwa/bwa mem -t 38 ../genome/f_het_genome_3.0.2.fa.gz ../cleaned/${i} | /opt/samtools-1.6/bin/samtools view -@ 16 -bSu - | /opt/samtools-1.6/bin/samtools sort -@ 16 - -o ./${i:0:-6}.$
done
What are the depths of mapped reads (from bam files)?
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=1
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bam_depths
# set name of output file
#SBATCH --output=bam_depths
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in *bam
do
$samtools depth $i | awk -v var="$i" '{sum += $3; n++} END {print " '${i%.bam}' ", sum / n}' >> depths.txt
done
bam_depths<-read.table("./snp_calling/QC/depths.txt", sep = " ")
bam_depths<-bam_depths[,-c(1,3)]
colnames(bam_depths)<-c("fish_id", "depth")
ggplot(data = bam_depths)+geom_density(aes(x = depth))+labs(title = "mean depth at mapped reads (bam) over all samples", x = "depth", y = "frequency")
#we could also easily split up the depth file here to look at population level variation in depth
let’s also look at overall number mapped reads per sample
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=1
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bam_mapped_reads
# set name of output file
#SBATCH --output=bam_mapped_reads.out
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in *bam
do
{ $samtools view -c -F 4 $i; echo ${i%.bam} ; } | tr "\n" " " >> mapped_reads.txt
done
mapped_reads<-read.table("./snp_calling/QC/mapped_reads.txt", sep = " ")
ggplot(data = mapped_reads)+geom_histogram(aes(x = log10(mapped_reads[,1])))+labs(title = " mapped reads in bam files", x = "log10 reads")
grandis
keeping grandis in the analysis may be useful to reconstruct the ancestral state of the genome and therefore create an unfolded SFS, but how well do the grandis reads align to the fundulus reference genome, let’s compare the bam files:
require(stringr)
colnames(mapped_reads) <- c("reads", "sample")
mapped_reads$pop<-str_extract(mapped_reads$sample, "[^_]+")
mapped_reads %>%
group_by(pop) %>%
summarize(mean = mean(reads), sd = sd(reads), n = n())
ggplot()+geom_density(data = mapped_reads[(mapped_reads$pop!="grandis"),],aes(x=reads))+geom_density(data = mapped_reads[(mapped_reads$pop=="grandis"),],aes(x=reads), color = "red")+labs(title = "total number mapped reads", subtitle = "red - grandis ; black - heteroclitus")
this isn’t ideal though, lets try to figure out the PROPORTION of reads mapped to the reference genome
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=1
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bam_mapped_reads
# set name of output file
#SBATCH --output=bam_mapped_reads.out
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in *bam
do
{ $samtools view -c -F 0 $i; echo ${i%.bam} ; } | tr "\n" " " >> total_reads.txt
done
#total_reads <- read.table("./snp_calling/QC/total_reads.txt", sep = " ")
# make column of total reads, then proportion reads, plot again
ANGSD Analysis outline
We conduct two sets of ANGSD analyses. The first calculates genotype likelihoods. The second estimates the site frequency spectrum.
grViz("digraph flowchart {
# node definitions with substituted label text for placing code
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
tab9 [label = '@@9']
# edge definitions with the node IDs
tab1 -> tab3;
tab1 -> tab2;
tab3 -> tab4 -> tab5
tab4 -> tab6
tab3 -> tab7
tab3 -> tab8
tab9 -> tab1
}
[1]: 'ANGSD -GL 1 -doMaf -doMajorMinor -glf 2'
[2]: 'Allele Frequencies'
[3]: 'beagle format GLs'
[4]: 'covariance matrix (PCangsd)'
[5]: 'PCA (R)'
[6]: 'RDA/other multivariate EAAs (R)'
[7]: 'LD (ngsLD)'
[8]: 'Admixture (ngsAdmix)'
[9]: 'bam files'
")
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
# edge definitions with the node IDs
tab1 -> tab2
tab2 -> tab3
tab3 -> tab4 -> tab5
tab3 -> tab6
tab3 -> tab7
tab8 -> tab1
}
[1]: 'ANGSD -GL 1 -doSAF unfolded'
[2]: 'SAF'
[3]: 'SFS (realSFS)'
[4]: 'Dadi conversion '
[5]: 'moments/Dadi'
[6]: 'theta, tajD'
[7]: 'FST'
[8]: 'bam files'
")
Create lists of bams for all the separate populations
#in server directory containing bams
# for each population create a list of paths to bam files
find "$(cd ..; pwd)" -iname "BP*bam" > ../BP.bams
find "$(cd ..; pwd)" -iname "CT*bam" > ../CT.bams
find "$(cd ..; pwd)" -iname "GA*bam" > ../GA.bams
find "$(cd ..; pwd)" -iname "HB*bam" > ../HB.bams
find "$(cd ..; pwd)" -iname "KC*bam" > ../KC.bams
find "$(cd ..; pwd)" -iname "M_*bam" > ../M.bams
find "$(cd ..; pwd)" -iname "MDIBL_*bam" > ../MDIBL.bams
find "$(cd ..; pwd)" -iname "ME_*bam" > ../ME.bams
find "$(cd ..; pwd)" -iname "MG_*bam" > ../MG.bams
find "$(cd ..; pwd)" -iname "NBH_*bam" > ../NBH.bams
find "$(cd ..; pwd)" -iname "F_*bam" > ../F.bams
find "$(cd ..; pwd)" -iname "H_*bam" > ../H.bams
find "$(cd ..; pwd)" -iname "P_*bam" > ../P.bams
find "$(cd ..; pwd)" -iname "SYC_*bam" > ../SYC.bams
find "$(cd ..; pwd)" -iname "NC_*bam" > ../NC.bams
find "$(cd ..; pwd)" -iname "OC_*bam" > ../OC.bams
find "$(cd ..; pwd)" -iname "RB_*bam" > ../RB.bams
find "$(cd ..; pwd)" -iname "SC_*bam" > ../SC.bams
find "$(cd ..; pwd)" -iname "SH_*bam" > ../SH.bams
find "$(cd ..; pwd)" -iname "SLC_*bam" > ../SLC.bams
find "$(cd ..; pwd)" -iname "SR_*bam" > ../SR.bams
find "$(cd ..; pwd)" -iname "WNC_*bam" > ../WNC.bams
find "$(cd ..; pwd)" -iname "grandis_*bam" > ../grandis.bams
cat *bams > ALL.bamlist
first compress the genome and create a compression index
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bgzip_index
# set name of output file
#SBATCH --output=bgzip.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
/home/ddayan/software/htslib-1.9/bgzip -i -@ 10 f_het_genome_3.0.2.fa
then create a samtools index
samtools faidx f_het_genome_3.0.2.fa.gz
also need to index all the bam files
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=bgzip_index
# set name of output file
#SBATCH --output=bgzip.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
source /opt/samtools-1.6/source_me
samtools=/opt/samtools-1.6/bin/samtools
for i in ./*.bam;do $samtools index $i;done
First lets take a look at coverage and allele freqeuncy to get an idea what the data looks like prior to filtering
rationale for options:
command options
-P 10: use 10 threads
-ref: path to reference genome
-out: output
-r : run on only one contig (to check speed)
filtering
-uniqueOnly: use only uniquley mapping reads
-remove_bads: discard bad reads
-trim 0: perform no trimming
-C exclude reads with excessive mismatches
-baq avoid SNPs called due to indels
-minmapQ exclude reads with poor mapping
-maxDepth - r)ead depth above this level are binned (set high to llok at actual depth distrubution
dos
doqsdist calulates distribution of quality scores of used reads
dodepth calculates distrubtion of read depths
docounts needed for depth calcuation
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=bwa_default
# set name of output file
#SBATCH --output=bwadefault.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
$ANGSD -P 10 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc \
-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -maxDepth 1000 \
-minMapQ 20 -doQsDist 1 -doDepth 1 -doCounts 1
results of initial ANGSD run Q scores and global depth
## barplot q-scores
qs <- read.table(file = "./snp_calling/ANGSD_outputs/qc_check/ALL.qc.qs", head=T, stringsAsFactors=F)
barplot(height=qs$counts, names.arg=qs$qscore, xlab="Q-score", ylab="Counts", main = "Q-score distribution")
## global depth
dep <- as.numeric(scan(file = "./snp_calling/ANGSD_outputs/qc_check/ALL.qc.depthGlobal",what="char", quiet=T))
dep_freq<-as.data.frame(cbind(dep, seq(1:length(dep))))
colnames(dep_freq)<-c("count", "depth")
dep_freq$depth<-dep_freq$depth-1
ggplot(data = dep_freq, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth", title = "global depths (summed across all fish)")
ggplot(data = dep_freq, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth", title = "subset of plot above")+xlim(0, 100)
Depths for a handful of individuals:
## individual depth
idep<-read.table("./snp_calling/ANGSD_outputs/qc_check/ALL.qc.depthSample")
#look at read depth histogram for the first tenindividual
i1<-as.data.frame(t(idep[1,]))
i1$depth<-c(0:999, by = 1)
colnames(i1)<-c("log10count", "depth")
i1$log10count <- log10(i1$log10count)
ggplot(data = i1, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")
i200<-as.data.frame(t(idep[1,]))
i200$depth<-c(0:999, by = 1)
colnames(i200)<-c("log10count", "depth")
i200$log10count <- log10(i200$log10count)
ggplot(data = i200, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")
i505<-as.data.frame(t(idep[1,]))
i505$depth<-c(0:999, by = 1)
colnames(i505)<-c("log10count", "depth")
i505$log10count <- log10(i505$log10count)
ggplot(data = i200, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth", title = )
other salient results Total number of sites analyzed: 567547 Number of sites retained after filtering: 559561
thoughts here: - most sites have very limited depth, with 1480 samples, the global depth is largely less than 1000
- 59000 sites (about 10%) have more than 1000 reads, are these driven by a moderate depth in a lot of individuals or extremely high depth in just a few?
- lets run this again but only retain sites present in 80% of individuals
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=bwa_default
# set name of output file
#SBATCH --output=bwadefault.out
# send mail to this address
#SBATCH --mail-user=ddayan@clarku.edu
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
$ANGSD -P 10 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc -r KN811289.1 \
-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -maxDepth 1000 -minInd 1184 \
-minMapQ 20 -doQsDist 1 -doDepth 1 -doCounts 1
no sites were retained with this level of filtering, lets scale up to the ful genome and see how many sites with coverage in 80% of individuals we get: 2702 sites …
this ANGSD output is very sparse, how many loci are these sites spread across? how many reads are they based on before and after filtering? what are the mapping qualities? why is the depth data presented in the least convenient imaginable format? I miss stacks and tassel…
lets try again but a little less stringent, here we retain sites with coverage in 70% of fundulus heterociltis in the dataset (1010),
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=0-23:59:00
#SBATCH --cpus-per-task=10
# set partition/queue to use
#SBATCH --partition=day-long-cpu
# set name of job
#SBATCH --job-name=angsd_0.7
# set name of output file
#SBATCH --output=angsd_0.7.out
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
$ANGSD -P 10 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc -r KN811289.1 \
-uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -minInd 1010 \
-minMapQ 20
how many loci is this over? we can use gatk countloci for this
prep:
GATK="/opt/java/jdk1.8.0_121/bin/java -jar /opt/Gatk/GenomeAnalysisTK.jar"
#create dictionary for gatk
java -jar /opt/picard-tools-1.119/CreateSequenceDictionary.jar R= fundulus/genome/f_het_genome_3.0.2.fa O= fundulus/genome/f_het_genome_3.0.2.dict
#create an index for the uncompressed genome file (gatk only run on decompressed reference genomes)
samtools faidx f_het_genome_3.0.2.fa
#now we can finally try to run GATK
$GATK -T CountLoci -R f_het_genome_3.0.2.fa -I ../aligned/OC_421.merged.bam
####agh this doesn't work because read groups were not retained in the bam files...
Here we’re going to finally run ANGSD
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=angsd_GL_0.1
# set name of output file
#SBATCH --output=angsd_GL_0.1.out
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
FILTERS="-minInd 740 -uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -skipTriallelic 1 -SNP_pval 1e-6 -minMapQ 20 -maxDepth 14800"
DOS="-GL 1 -doMajorMinor 1 -doMaf 2 -doGlf 2 -doCounts 1 -doDepth 1"
$ANGSD -P 38 -b ../meta_data/ALL.bamlist -ref $REF -out ./Results/ALL.qc \
$FILTERS $DOS \
Total number of sites analyzed: 80148766
Number of sites retained after filtering: 62571
Coverage
## global depth
dep <- as.numeric(scan(file = "./snp_calling/ANGSD_outputs/run_0.1//ALL.qc.depthGlobal",what="char", quiet=T))
dep_freq<-as.data.frame(cbind(dep, seq(1:length(dep))))
colnames(dep_freq)<-c("count", "depth")
dep_freq$depth<-dep_freq$depth-1
dep_freq$log_count<-log10(dep_freq$count)
ggplot(data = dep_freq, aes(depth, log_count))+geom_bar(stat = "identity")+labs(x = "Depth", title = "global depths (summed across all fish)")
## Warning: Removed 9208 rows containing missing values (geom_bar).
Of the 62571 sites, 47984 (76%) have more than 14800 reads and show up as a single thin bar at 14800 in this plot
## individual depth
idep<-read.table("./snp_calling/ANGSD_outputs/run_0.1/ALL.qc.depthSample")
#look at read depth histogram for the first tenindividual
#i100<-as.data.frame(t(idep[1,]))
#i100$depth<-c(0:14799, by = 1)
#colnames(i100)<-c("log10count", "depth")
#i100$log10count <- log10(i100$log10count)
#ggplot(data = i100, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,500))
i100<-as.data.frame(t(idep[100,]))
i100$depth<-c(0:14799, by = 1)
i100$log10count <- log10(i100$`1`)
colnames(i100)<-c("count", "depth", "log10count")
#ggplot(data = i100, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,500))
ggplot(data = i100, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,400))+ylim(c(0,4000))+labs(title = "individual 100")
## Warning: Removed 14400 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_bar).
i1000<-as.data.frame(t(idep[1000,]))
i1000$depth<-c(0:14799, by = 1)
i1000$log10count <- log10(i1000$`1000`)
colnames(i1000)<-c("count", "depth", "log10count")
#ggplot(data = i1000, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,1000))
ggplot(data = i1000, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,400))+ylim(c(0,2000))+labs(title = "individual 1000")
## Warning: Removed 14400 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_bar).
i500<-as.data.frame(t(idep[500,]))
i500$depth<-c(0:14799, by = 1)
i500$log10count <- log10(i500$`500`)
colnames(i500)<-c("count", "depth", "log10count")
#ggplot(data = i500, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,500))
ggplot(data = i500, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,300))+ylim(c(0,3000))+labs(title = "individual 500")
## Warning: Removed 14500 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_bar).
i148<-as.data.frame(t(idep[148,]))
i148$depth<-c(0:14799, by = 1)
i148$log10count <- log10(i148$`148`)
colnames(i148)<-c("count", "depth", "log10count")
#ggplot(data = i148, aes(depth, log10count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,148))
ggplot(data = i148, aes(depth, count))+geom_bar(stat = "identity")+labs(x = "Depth")+xlim(c(0,300))+ylim(c(0,3000))+labs(title = "grandis individual")
## Warning: Removed 14500 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_bar).
if we look at a handful of depth for individual fish (reads per locus per individual) we observe a nice distribution with mostly >20x coverage
a note on installation (code chunk below):
#didn't want to wait for HPC to install pcangsd so did it locally. the problem (as always) is that pcangsd has depnedinecies that get install to local python but slurm calls a different python. used the tool virtualenv to get around this
#create the virtual environment in the directory containing pcagnsd
virtualenv software/pcangsd/
cd software/pcangsd/
#ctivate it
source bin/activate
#install dependcies and angsd
pip install cython
pip install numpy
python setup.py build_ext --inplace
pip install -r requirements.txt
#check install
python pcangsd.py
#deactive virtualenv
#when running in batch script make sure to source this virtual env before executing the actual command
deactivate
ok now lets try a PCA
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=pcangsd_0.1
# set name of output file
#SBATCH --output=pcangsd_0.1.out
source /home/ddayan/software/pcangsd/bin/activate
python /home/ddayan/software/pcangsd/pcangsd.py -beagle ../Results/ALL_gl_0.1.beagle.gz -o pcangsd.gl_0.1 -threads 38
Let’s looks at these PCA results
cov_mat<-read.table("./pcangsd/pcangsd.gl_0.1.cov", sep = " ")
filenames = scan("./snp_calling/ANGSD_outputs/ALL.bamlist", what="character")
fish_names = gsub("/home/ddayan/fundulus/aligned/", "", gsub(".merged", "", gsub(".bam", "", filenames)))
pc1<-prcomp(cov_mat)
pc_inds<-as.data.frame(pc1$x)
row.names(pc_inds)<-fish_names
pc_inds$pop<-gsub("_\\d+", "", fish_names)
#caution this deletes al nbh pops except nbh, fix later
pc_inds <- merge(pc_inds, sites[,c("Abbreviation", "Latitude")], by.x = "pop" , by.y ="Abbreviation")
ggplot(data = pc_inds) + geom_point(aes(x = PC1, y = PC2, color = Latitude))+scale_color_gradientn(colours = rainbow(5))#+stat_ellipse(aes(x = PC1, y = PC2, group = pop),type = "norm")
plot_ly(x=pc_inds$PC1, y=pc_inds$PC2, z=pc_inds$PC3, type="scatter3d", mode="markers", color=pc_inds$pop)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
hover mouse over points for population id - population codes are as in the table “sites” under the section “data description” at top of document
it looks like there are some QC issues to take on (or a grandis individual is more similar to f het than other grandis), but that this approach is at least working well enough to differentiate pops in PC space
grandis
we will use the grandis reads to create a grandis consensus sequence (dofasta in angsd) which will be used as the ancestral state for unfolded SFS estimation, but we will not estimate genotype likelihoods in grandis as this will influence SNP probabilities and it seems that including grandis dominates the structure found by PCA
HWE
no HWE filtering, filtering out paralogues by using excess coverage instead - rationale here is that HWE is likely to be skewed given the amount of population strcutre in the sample and that filtering by HWE stratified by population is not possible in the ANGSD pipeline
MAF
MAF filtering will skew the SFS and should not be applied to the methods that rely on SFS (demography inference, FST), but the inclusion of singletons called genotyped datasets can confound STRUCTURE (little effect on multivariate population structure algorithms e.g. PCA DAPC). My assumption is that the use of genotype likelihoods instead of called genotypes should reduce the impact of singletons or low frequency alleles (erroneous or true) on the inference of structure using model based approaches like STRUCTURE, and the inclusion of rare alleles will improve discrimination among recently diverged populations (or populations with rare gene flow) when using multivariate approaches
max_coverage
will set max coverage at 3x the modal coverage per locus, this means will have to run ANGSD again with a higher maxdepth cutoff for the coverage output in order to estimate it…
min_coverage_locus
no hard minimum coverage cutoff, instead use stringent snp_pval cutoff and minimum number of individuals (we require that a site be present in 50% of samples - 686)
minimum_coverage_ind
the mean coverage for the individuals (at the bam level) is ~730k, but a handful of individuals have very low coverege (see bam stats section), will remove these individuals BEFORE ANGSD analysis by editing bam_lists, removed individuals in the lowest 5% of the coverage distribution
snp_pval
impose stringent probablility cutoff for reatining a SNP: 1e-6
quality_and_mapping
-uniqueOnly 1 -remove_bads 1 -C 50 -baq 1 -skipTriallelic 1 -minMapQ 20
first make a bamlist
#get rid of individuals with low sequencing
bad_inds <- mapped_reads[mapped_reads$V1<quantile(mapped_reads$V1, 0.05),2]
write.table(bad_inds, file = "./snp_calling/QC/bad_inds.txt", quote = FALSE, row.names = FALSE)
grep -v -f bad_inds.txt ALL.bamlist | grep -v 'grandis' > good_no_grandis.bamlist
This removed all grandis and 72 additional individuals
#!/bin/bash
# set max wall-clock time (D-HH:MM:SS)
#SBATCH --time=6-23:59:00
#SBATCH --cpus-per-task=38
# set partition/queue to use
#SBATCH --partition=week-long-cpu
# set name of job
#SBATCH --job-name=angsd_GL_1.0
# set name of output file
#SBATCH --output=angsd_GL_1.0.out
ANGSD="/opt/angsd0.928/angsd/angsd"
REF="/home/ddayan/fundulus/genome/f_het_genome_3.0.2.fa.gz"
FILTERS="-minInd 686 -uniqueOnly 1 -remove_bads 1 -trim 0 -C 50 -baq 1 -skipTriallelic 1 -SNP_pval 1e-6 -minMapQ 20"
DOS="-GL 1 -doMajorMinor 1 -doMaf 2 -doGlf 2 -doCounts 1 -dumpCounts 2"
$ANGSD -P 38 -b ../meta_data/good_no_grandis.bamlist -ref $REF -out ./Results/gl_1.0 \
$FILTERS $DOS \
extract coverage info and plot
choose max depth and filter for final dataset